Drew Conway’s Venn Diagram of Data Science
Drew Conway’s Venn Diagram of Data Science
sample.data<-read.csv("click_data.csv")
sample.data<-sample.data %>% mutate(visit_date=mdy(visit_date))
summarized.data<-sample.data %>% group_by(month(visit_date))%>%
summarise(avg=mean(clicked_adopt_today))
summarized.data
## # A tibble: 12 x 2 ## `month(visit_date)` avg ## <dbl> <dbl> ## 1 1 0.197 ## 2 2 0.189 ## 3 3 0.145 ## 4 4 0.15 ## 5 5 0.258 ## 6 6 0.333 ## 7 7 0.348 ## 8 8 0.542 ## 9 9 0.293 ## 10 10 0.161 ## 11 11 0.233 ## 12 12 0.465
n web pagesresult<-prop.test(x=c(50,30),n=c(200,200),correct = FALSE) result$p.value
## [1] 0.01241933
Remember we are tied to a decision here, does new webpage generate more donations/clicks?
If you decide to switch to new webpage, there is a 1.3% chance you are wasting your time
Does not make decision for you, must factor in cost. If new webpage costs double of old webpage that should be accounted for.
result$conf.int
## [1] 0.02221634 0.17778366 ## attr(,"conf.level") ## [1] 0.95
result$conf.int
## [1] 0.02221634 0.17778366 ## attr(,"conf.level") ## [1] 0.95
Out of 200 people, we would expect 4-35 more to engage with new webpage
If each engagement is expected to bring in 1 doller (on average) and new webpage costs an additional 50 dollars a month to maintain, may not be worth it
Assumes metric we are interested in is positive tweets that mention WPAOG
Hypothesis: If I do X, then people will tweet more positively about WPAOG
added = read_csv("AOGTweets.csv")
textcleaned = added %>%
unnest_tokens(word, text)
cleanedarticle = anti_join(textcleaned,stop_words, by = "word") %>%
select(line,word,screenname,time,followerscount,favoritescount, searchterm)
## # A tibble: 10 x 2 ## dayofweek AverageDailySentiment ## <date> <dbl> ## 1 2019-11-06 0.221 ## 2 2019-11-07 0.0952 ## 3 2019-11-08 0.177 ## 4 2019-11-09 0.218 ## 5 2019-11-10 0.210 ## 6 2019-11-11 0.143 ## 7 2019-11-12 0.0115 ## 8 2019-11-13 0.198 ## 9 2019-11-14 0.192 ## 10 2019-11-15 0.108
| screenname | text | followerscount | favoritescount | polaritysent | influencescore |
|---|---|---|---|---|---|
| WPAOG | Today we salute all veterans. Thank you to all the brave men and women and members of the #LongGrayLine who served and continue to serve our great nation. #DutyHonorCountry https://t.co/0nNNDVzVyT | 15353 | 95 | 0.1793366 | 17.036979 |
| WPAOG | #OTD 12 November-Today is the birthday of Rebecca E. Marier McGuigan (1995), the first woman to be rated the top all-around West Point graduate–in academic, military training & physical trainingsince women first graduated in 1980. She is an active duty Army doctor. #WPAOG150 https://t.co/2IGfg8zSYY | 15353 | 164 | 0.2392355 | 39.234618 |
| WPAOG |
This weekend, WPAOG proudly welcomes the Class of 1994 back to West Point for their 25th reunion! We expect more than 380 graduates, family members and guests to return to USMA for the festivities! #WithCourageWeSoar https://t.co/6w5brxAgPp |
15353 | 32 | 0.2930217 | 9.376695 |
| screenname | text | followerscount | favoritescount | polaritysent | influencescore |
|---|---|---|---|---|---|
| WPAOG | #OTD 11 Nov 1918, Armistice Day (now Veterans Day in the U.S.) ended World War I, which at the time was The War to End All Wars. Thirty-two West Point graduates were killed in the fighting. #WPAOG150 https://t.co/vTtLOK1wNn | 15353 | 6 | -0.2836056 | -1.7016336 |
| sailingtobyzant | Congratulations to West Point grads in gov. service, except William B Taylor, Jr., Ambassador to Ukraine, who is dishonorably engaged in dishonoring his Commander, President Trump, with triple hearsay to benefit Deep State Anti-American Socialist/Liberal/Communists Democrats. https://t.co/4BWVfz5NVP | 2823 | 1 | -0.1092247 | -0.1092247 |
| AEVanSaun |
Spent the morning sharing a few stories about my old roommate Dennis, and then going on a run. Can’t believe it’s been 14 years. #StrongGrayLine @WestPoint_USMA @WPAOG https://t.co/jUO3ltYEwf |
634 | 23 | -0.0057540 | -0.1323428 |
gah<-decompose(ts(trends$Response,freq=12,
start=decimal_date(ymd("2004-01-01"))))
plot(gah)
## # A tibble: 5 x 3 ## Response dayofweek err ## <dbl> <date> <dbl> ## 1 90 2005-11-01 20.7 ## 2 90 2006-09-01 23.4 ## 3 87 2016-12-01 22.1 ## 4 100 2017-12-01 33.6 ## 5 93 2018-12-01 26.5
82/(5*10)=1.64Transaction data from 01/12/2010 to 09/12/2011 for a UK-based registered non-store online retail, 541909 Transactions
tr <- read.transactions('market_basket_transactions.csv', format = 'basket', sep=',')
association.rules <- apriori(tr, parameter = list(supp=0.03, conf=0.85,maxlen=5),
control=list(verbose=FALSE))
inspect(association.rules[1:6])
## lhs rhs support confidence lift count
## [1] {BLUE HAPPY BIRTHDAY BUNTING} => {PINK HAPPY BIRTHDAY BUNTING} 0.03521610 0.8555556 19.691287 154
## [2] {CANDLEHOLDER PINK HANGING HEART} => {WHITE HANGING HEART T-LIGHT HOLDER} 0.04070432 0.9128205 5.238536 178
## [3] {PINK REGENCY TEACUP AND SAUCER} => {GREEN REGENCY TEACUP AND SAUCER} 0.05694032 0.9154412 11.914358 249
## [4] {GREEN REGENCY TEACUP AND SAUCER,
## PINK REGENCY TEACUP AND SAUCER} => {ROSES REGENCY TEACUP AND SAUCER} 0.04962268 0.8714859 10.384218 217
## [5] {PINK REGENCY TEACUP AND SAUCER,
## ROSES REGENCY TEACUP AND SAUCER} => {GREEN REGENCY TEACUP AND SAUCER} 0.04962268 0.9475983 12.332878 217
## [6] {PINK REGENCY TEACUP AND SAUCER,
## REGENCY CAKESTAND 3 TIER} => {GREEN REGENCY TEACUP AND SAUCER} 0.04504917 0.9336493 12.151334 197
Starts with the question; why do you want to use data?
Data that is not accessible might as well not exist
Databasing is boring; cleaning data is boring; but both must be done
Don’t try to fix yesterday, find question you want answered today and start collecting relevant data
Don’t focus on sledgehammers; smaller tools are most of the time better
Buzz words: Big Data, Deep Learning, AI/ML
People are by nature story tellers; use data to tell the story, don’t let data be the story